Load data. Please check the file path.
reviews2.csv <- read.csv('~/Dropbox/Eugenie/data/arslan-reviews2.csv')
Turn numeric values to factors.
## turn numeric values to factors
reviews2.csv$is_deleted <- as.factor(reviews2.csv$is_deleted)
reviews2.csv$incentivized <- as.factor(reviews2.csv$incentivized)
reviews2.csv$verified_purchaser <- as.factor(reviews2.csv$verified_purchaser)
## change level values
levels(reviews2.csv$verified_purchaser) <- c("unverified", "verified")
levels(reviews2.csv$incentivized) <- c("non-incentivized", "incentivized")
levels(reviews2.csv$is_deleted) <- c("kept", "deleted")
## get relevant columns
cols <- c('recid', 'item_id', 'user_id', 'text')
reviews2.text <- as.data.frame(reviews2.csv[, cols])
## turn numeric values to factors
reviews2.text$recid <- as.factor(reviews2.text$recid)
## turn factors to char vectors for tidy unnest_tokens
reviews2.text$text <- as.character(reviews2.text$text)
## get tidy tokens for each review record
tidy.reviews2.text <- reviews2.text %>%
unnest_tokens(word, text)
## remove stop words
data(stop_words)
tidy.reviews2.text <- tidy.reviews2.text %>%
anti_join(stop_words)
## Joining, by = "word"
Words with over 20,000 appearances in reviews as a whole.
## inner join with afinn sentiments
afinn.reviews2 <- tidy.reviews2.text %>%
inner_join(get_sentiments('afinn'))
## Joining, by = "word"
## sum the sentiment of words by record
afinn.reviews2 <- afinn.reviews2 %>%
group_by(recid) %>%
mutate(word.count=n()) %>%
mutate(afinn.sentiment=sum(value)) %>%
mutate(method='AFINN')
Get an index ranged from -1 to 1 using
word count per review
rescale function from package scales
## scale to -1 to 1 index, afinn sentiment is calculated on a -5 to 5 scale
afinn.reviews2 <- afinn.reviews2 %>%
mutate(afinn.index=(afinn.sentiment/word.count)/5)
## scale to -1 to 1 index with rescale function from package scales
afinn.reviews2$afinn.sentiment.std <- rescale(afinn.reviews2$afinn.sentiment, to=c(-1,1))
Left join to selected columns of the original review2.csv file
reviews2.sentiment <- merge(reviews2.csv[,c('recid', 'item_id', 'rating', 'incentivized', 'is_deleted', 'verified_purchaser', 'text', 'title')],
afinn.reviews2[,c('recid', 'afinn.sentiment', 'afinn.index', 'afinn.sentiment.std')], by='recid', all.x = T)
Observe missing values: ~86205 records don’t have an afinn sentiment index
summary(reviews2.sentiment$afinn.index)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 -0.03 0.20 0.16 0.40 1.00 86205
reviews2.sentiment[, c('incentivized','afinn.index','afinn.sentiment')] %>%
group_by(incentivized) %>%
summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
## incentivized afinn.index afinn.sentiment
## <fct> <dbl> <dbl>
## 1 non-incentivized 0.159 2.49
## 2 incentivized 0.203 11.9
The summary statistics is quite surprising. I wonder if I made mistakes in any steps (e.g., there may be problems when I normalize the sentiment scores to an index)
There is a big difference in the Afinn sentiment score between non-incentivized vs. incentivized reviews
However, the difference is not as pronounced in the normalized index between the two groups
Remove records with any na, so only ~60% of the data is available for ploting
reviews2.sentiment.all <- na.omit(reviews2.sentiment)
Boxplot: afinn sentiment score vs. rating
This record is the extreme outlier in the boxplot. The text review has 2476 words, which is the longest review in the dataset. It has a rating of 5 but receives a low sentiment index in all three lexicons. May need some additional investigation/processing.
reviews2.csv[reviews2.csv$recid == 35676004, c('recid', 'item_id', 'rating', 'incentivized', 'is_deleted', 'verified_purchaser', 'title', 'word_count')]
## recid item_id rating incentivized is_deleted
## 22625 35676004 B01422TC14 5 incentivized deleted
## verified_purchaser
## 22625 unverified
## title
## 22625 Something for the weekend sir?. A 79.01%* efficient power bank that keeps on giving.
## word_count
## 22625 2476
Boxplot: afinn sentiment index vs. rating
Use Afinn sentiment score to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- afinn.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 18-20532, N = 411039
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -144.697026 -3.173747 -0.058654 2.754540 44.176404
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized 6.706639 0.059600 112.528 < 2.2e-16 ***
## is_deleteddeleted 1.201850 0.036068 33.322 < 2.2e-16 ***
## verified_purchaserverified -1.614790 0.034614 -46.652 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 15475000
## Residual Sum of Squares: 14200000
## R-Squared: 0.082417
## Adj. R-Squared: 0.082187
## F-statistic: 12303.4 on 3 and 410935 DF, p-value: < 2.22e-16
Use Afinn sentiment index to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- afinn.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 18-20532, N = 411039
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1.2398660 -0.1846995 0.0096222 0.2053119 1.0659298
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized 0.0325484 0.0030401 10.7064 < 2e-16 ***
## is_deleteddeleted 0.0230638 0.0018397 12.5364 < 2e-16 ***
## verified_purchaserverified 0.0044500 0.0017656 2.5204 0.01172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 36993
## Residual Sum of Squares: 36946
## R-Squared: 0.0012541
## Adj. R-Squared: 0.0010037
## F-statistic: 171.994 on 3 and 410935 DF, p-value: < 2.22e-16
## inner join with bing sentiments
bing.reviews2 <- tidy.reviews2.text %>%
inner_join(get_sentiments('bing'))
## Joining, by = "word"
## get sentiments of records by counting the number of positive vs. negative words per record
bing.reviews2 <- bing.reviews2 %>%
group_by(recid) %>%
summarise(positive.count=sum(sentiment=='positive'),
negative.count=sum(sentiment=='negative'))
bing.reviews2 <- bing.reviews2 %>%
mutate(bing.sentiment=positive.count-negative.count) %>%
mutate(word.count=positive.count+negative.count) %>%
mutate(method='BING')
Get an index ranged from -1 to 1 using
word count per review. Bing sentiment is calculated on a binary (positive/negative) scale
rescale function from package scales
## scale to -1 to 1 index based on word count per review
bing.reviews2 <- bing.reviews2 %>%
mutate(bing.index=bing.sentiment/word.count)
## scale to -1 to 1 index
bing.reviews2$bing.sentiment.std <- rescale(bing.reviews2$bing.sentiment, to=c(-1,1))
## observe outliers, which make the standardize tool not so helpful
Left join to selected columns of the original review2.csv file
## left join
reviews2.sentiment <- merge(reviews2.sentiment, bing.reviews2[,c('recid', 'bing.sentiment', 'bing.index', 'bing.sentiment.std')], by='recid', all.x = T)
## deduplicate
reviews2.sentiment <- reviews2.sentiment[!duplicated(reviews2.sentiment), ]
Observe missing values: ~72728 records don’t have a bing sentiment index
summary(reviews2.sentiment$bing.index)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 0.00 1.00 0.45 1.00 1.00 72728
reviews2.sentiment[, c('incentivized','bing.index','bing.sentiment')] %>%
group_by(incentivized) %>%
summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
## incentivized bing.index bing.sentiment
## <fct> <dbl> <dbl>
## 1 non-incentivized 0.448 0.911
## 2 incentivized 0.560 5.47
There is a big difference in both of the Bing sentiment score and the index between non-incentivized vs. incentivized reviews.
Remove records with any na, so only ~60% of the data is available for ploting
reviews2.sentiment.all <- na.omit(reviews2.sentiment)
Boxplot: bing sentiment score vs. rating
Boxplot: bing sentiment index vs. rating
Use Bing sentiment score to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- bing.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 10-6697, N = 167323
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -19.26536 -0.84110 0.16953 0.97295 40.11604
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized 3.920282 0.050193 78.104 < 2.2e-16 ***
## is_deleteddeleted 0.339129 0.019828 17.104 < 2.2e-16 ***
## verified_purchaserverified -0.300151 0.019834 -15.133 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 662620
## Residual Sum of Squares: 625440
## R-Squared: 0.056123
## Adj. R-Squared: 0.055542
## F-statistic: 3314.29 on 3 and 167219 DF, p-value: < 2.22e-16
Use Bing sentiment index to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- bing.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 10-6697, N = 167323
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1.71628 -0.44976 0.37199 0.50104 1.20615
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized 0.1207862 0.0176618 6.8389 8.010e-12 ***
## is_deleteddeleted 0.0559708 0.0069769 8.0223 1.044e-15 ***
## verified_purchaserverified 0.0542145 0.0069790 7.7682 8.003e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 77520
## Residual Sum of Squares: 77439
## R-Squared: 0.0010419
## Adj. R-Squared: 0.00042656
## F-statistic: 58.1347 on 3 and 167219 DF, p-value: < 2.22e-16
loughran.pos.neg <- get_sentiments("loughran") %>%
filter(sentiment %in% c("positive", "negative"))
## inner join with loughrun sentiments
loughran.reviews2 <- tidy.reviews2.text %>%
inner_join(loughran.pos.neg)
## Joining, by = "word"
## get sentiments of records by counting the number of positive vs. negative words per record
loughran.reviews2 <- loughran.reviews2 %>%
group_by(recid) %>%
summarise(positive.count=sum(sentiment=='positive'),
negative.count=sum(sentiment=='negative'))
loughran.reviews2 <- loughran.reviews2 %>%
mutate(loughran.sentiment=positive.count-negative.count) %>%
mutate(word.count=positive.count+negative.count) %>%
mutate(method='LOUGHRAN')
Get an index ranged from -1 to 1 using
word count per review. Loughran sentiment is calculated on a binary (positive/negative) scale
rescale function from package scales
## scale to -1 to 1 index based on word count per review
loughran.reviews2 <- loughran.reviews2 %>%
mutate(loughran.index=loughran.sentiment/word.count)
## scale to -1 to 1 index
loughran.reviews2$loughran.sentiment.std <- rescale(loughran.reviews2$loughran.sentiment, to=c(-1,1))
## observe outliers, which make the standardize tool not so helpful
Left join to selected columns of the original review2.csv file
## left join
reviews2.sentiment <- merge(reviews2.sentiment, loughran.reviews2[,c('recid', 'loughran.sentiment', 'loughran.index', 'loughran.sentiment.std')], by='recid', all.x = T)
## deduplicate
reviews2.sentiment <- reviews2.sentiment[!duplicated(reviews2.sentiment), ]
Observe missing values: ~149126 records don’t have a bing sentiment index. Roughly twice as the na’s in Bing and Afinn Lexicon
summary(reviews2.sentiment$loughran.index)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 -1.00 1.00 0.23 1.00 1.00 149126
reviews2.sentiment[, c('incentivized','loughran.index','loughran.sentiment')] %>%
group_by(incentivized) %>%
summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
## incentivized loughran.index loughran.sentiment
## <fct> <dbl> <dbl>
## 1 non-incentivized 0.233 0.234
## 2 incentivized 0.237 0.659
There is only a slight difference in both of the Bing sentiment score and the index between non-incentivized vs. incentivized reviews.
Remove records with any na, so only ~45% of the data is available for ploting
reviews2.sentiment.all <- na.omit(reviews2.sentiment)
Boxplot: Loughran sentiment score vs. rating
Boxplot: Loughran sentiment index vs. rating
Use Loughran sentiment score to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- loughran.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 5-4736, N = 107208
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -23.46814 -1.03705 0.43128 0.84801 11.21556
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized 0.242001 0.040223 6.0165 1.788e-09 ***
## is_deleteddeleted 0.158052 0.018247 8.6616 < 2.2e-16 ***
## verified_purchaserverified 0.019979 0.017708 1.1283 0.2592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 211700
## Residual Sum of Squares: 211360
## R-Squared: 0.0015982
## Adj. R-Squared: 0.00063803
## F-statistic: 57.1484 on 3 and 107104 DF, p-value: < 2.22e-16
Use Loughran sentiment index to replce rating
# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- loughran.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within",
## index = c("item_id"))
##
## Unbalanced Panel: n = 101, T = 5-4736, N = 107208
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1.76389 -0.96487 0.48665 0.69911 1.50394
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## incentivizedincentivized -0.032729 0.024253 -1.3495 0.1772
## is_deleteddeleted 0.072872 0.011003 6.6232 3.531e-11 ***
## verified_purchaserverified 0.068195 0.010677 6.3870 1.698e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 76899
## Residual Sum of Squares: 76842
## R-Squared: 0.00074249
## Adj. R-Squared: -0.00021847
## F-statistic: 26.5277 on 3 and 107104 DF, p-value: < 2.22e-16
Observe 22% of total records are missing all three lexicons sentiment scores
## afinn.sentiment bing.sentiment loughran.sentiment Freq Freq_pct
## 1 FALSE FALSE FALSE 107208 40.5281880
## 2 TRUE FALSE FALSE 4989 1.8860079
## 3 FALSE TRUE FALSE 2168 0.8195761
## 4 TRUE TRUE FALSE 1036 0.3916424
## 5 FALSE FALSE TRUE 60115 22.7254685
## 6 TRUE FALSE TRUE 19487 7.3667338
## 7 FALSE TRUE TRUE 8831 3.3384116
## 8 TRUE TRUE TRUE 60693 22.9439717